Data loading and Cleaning

Next we will load the loan csv and then do the following

1) find out which all columns are having null/NaN values, find all these columns and drop them as they dont give any inputs.

2) use the null value dataframe to find which columns are null and filter out them from main data.

Now we will start data cleaning column wise and since the below columns have large % of null values.

1) 'mths_since_last_record' --> 92% null values,

2) 'next_pymnt_d' --> 97% null values

variables with less business significance 3) member_id --> because its a random number and we already have id as unique key. No inputs for making Charged off decisions

4) funded_amnt_inv --> The loan is funded, and investers are not going to get money back :)

5) sub_grade --> redundant information with grade column

6) issue_d --> the loan is already funded

7) zip_code --> Unable to process the data to a meaning form as it requires lot of patience.

8) out_prncp --> It kind of gives information to current loan rather than charged off cases.

9) out_prncp_inv --> It kind of gives information to current loan rather than charged off cases.

10) total_pymnt_inv --> redundant columns WRT to total_pymnt

11) total_rec_prncp, total_rec_int --> It kind of gives information to current loan rather than charged off cases.

replace the text columns with some knows strings instead of dropping the rows for it. then for the remaining columns which have null values we can drop the corresponding rows.

Analysis to check for mths_since_last_delinq releaved that for the column the nulls values are for cases charged off, hence replacing that column with std or median should be fine. Prefering std value so that we dont put too much penalty on the 4% null values which are in non-default cases.

Now we will start conditioning/value treatments for the respective columns after looking sample data from each column, like remove the % from float percent values, impute values And specifically for empoyee experience convert the categories into 10 buckets

1) Again use the categorical value first and then check if each category is a single constant value and then we can plan to exclude that as that will not give us any information

2) Build a dataframe on the univariate analysis for all the comtinuous variables, and then check if the columns is having single value like 0.0 or 0 and drop them as they dont give us information for analysis

3) Filter out the desc, url , title and emp_title as they can be considered for sentiment analysis kind of overwork here.

We can see that, most of the loans comprises of fully_paid status, while there is some chunk which is current running loans. We can not analyse based on current loans as they have not yet defaulted so lets get rid of them.

Univariate Analysis

1) Perform the univariate (for categorical and continous) indivdual columns, like removal of outliers, finding the tuning the mean and 50th quantile difference

2) Derrive new columns as well as and when required.

Start Anayzing the categorical variables, but first doing a coutn plot of loans applied for each category compared to the charged off % of each category.

While analysing the categorical column, we see some categories in count plot and some categories in missing default %plot as the reason would be that they are not applicable for "charged off" cases

From the above plots we can clearly see the following:

  1. Emp length -> No much variation in default rates
  2. Home ownership -> no visible trend
  3. verification status -> Verified loans are more dafaulted then other categories
  4. purpose -> Small business defaults more than other cat
  5. states -> not looking much useful so far
  6. grades -> there is incresing number of defaults across categories from A to G
  7. pub_rec_bankrupcies -> increasing treand with no of bank rupcies
  8. term -> more defaults on longer term loans
  9. nuber_of_der_pub_records -> more defaults with 1
  10. policy_code -> not useful
  11. delinq_2years -> increasing trends
  12. open_acc -> most defaults with more number of open accounts
  13. total_acc -> more defaults with large number of total accounts

From the loan amount distribution we can see that median (50th percentile) of the loan amount is 10000. Lets use this make some bins and analyse the default rates across different amounts

We can clearly see that higher the loan amount, more are the chances of default.

Date type analysis to check on the charged off % and total amount funded

Inferences from univariate analysis

Selecting the variables which might give good insigths on the data

1) emp_length --> we can see that higher exp people are showing more defaults

2) home_ownership --> 1 category is showing 19% of default, use in bivariate

3) verification_status --> 1 category i.e. Verified is shwoing 16% of defaults

4) purpose --> Small business categories is showing > 25% of defaults

5) state --> observed good power law distribution for states CA, NY, FL, TX

6) grade --> some grades are showing around 30% of default

7) last_pymnt_year --> some years show around 20% - 40% defaults, which includes year 2009 which was recession

8) int_rate --> split into multiple bins and analyzed further

9) month_saved_amnt --> business combined data WRT to installment and annual_inc which were showing good normal distributions

10) total_payment , revol_bal, loan_amnt --> Showing promising distribution for analysis

dropping the below variables which were kind of showing less significance on Charged Off data.

1) revol_util,dti --> kind of flat distribution/histogram plots

2) total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_amnt --> all of them show 0 values for the chanrged off loan_data

3) earliest_cr_line --> power law distribution shows that LSH and RHS are only showing max loans provided, and center part low number of loans and less data

4) last_pymnt_d -- > converted to last last_pymt_year

5) last_credit_pull_d --> giving same insights as that of last_payment_year

Segmented Univariate Analysis

Now making use of the above two plots, we can see that most of the defaults are happening across

Lets now check the distribution of default percentage across identified features from univariate analysis i.e.

emp_length, home_ownership,verification_status, grade, last_pymnt_year, int_rate_grp, emp_length, loan_amnt_bin,

Bivariate Analysis